Releases · huggingface/optimum-habana

14 Apr 16:34

regisss

v1.17.0

59e2d1d

v1.17.0: Transformers v4.49 Latest

Latest

Transformers v4.49

This release has been tested and validated for Transformers v4.49 and SynapseAI v1.20.

Upgrade to Transformers v4.49 #1698 @regisss

Model optimizations

Use token_idx_cpu int instead of token_idx tensor in slicing #1848 @jaygala223
Keep logits in bf16 #1835 @jaygala223
Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816 @deepak-gowda-narayana
Add G3 perf WA for Qwen2VL #1884 @nngokhale
Fix MPT regression #1857 @atakaha

Tests and CI

Slow test updates #1804 @ugolowic
Fix race condition when downloading nltk tokenizer #1802 @ugolowic
fea(): Skipped the torch_fx tests #1797 @imangohari1
Upstream tests #1834 @IlyasMoutawwakil
test_examples: add missing clip-roberta baseline #1852 @uartie
Separate slow tests by required number of cards #1803 @ugolowic
Update PR doc build workflow #1904 @regisss

Other

Disable HPU migration (future add-on to HF diffusers) for OH diffusers #1866 @dsocek
Allow explicit control over flash_attention_fast_softmax setting #1851 @astachowiczhabana

Contributors

uartie, ugolowic, and 9 other contributors

Assets 2

12 Mar 09:57

regisss

v1.16.0

106c10c

v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ

SynapseAI v1.20

This release has been tested on and validated for SynapseAI v1.20.

New models

Add Qwen2-VL #1542 @nngokhale
Add video-llava model support #1522 @kaixuanliu
Enable the i2vgen pipeline #1670 @yuanwu2017
DeepSeek_v3 support #1735 @srajabos

Llama 405b

Enable Llama 3.1 405B in FP8 #1745 @jaygala223
v1.16 Llama3-405B text-generation. Added DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API flag. #1812 @dsmertin
Revert placing llama on cpu #1827 @ugolowic

AWQ

Enable awq int4 in Gaudi #1691 @sywangyi
Fix dependency issue with --load_quantized_model_with_autoawq #1759 @schoi-habana

Various model optimizations

Optimizations and WAs to support HPU execution for Detr-Resnet-50 #1334 @sandeep-maddipatla
Optimized DeepSeek-v2 on Gaudi #1677 @gyou2021
Add xlm-roberta model support for tei-gaudi use case #1715 @kaixuanliu
Optimized SD3 pipeline #1682 @deepak-gowda-narayana
Add clear hpu cache flag for stable perf #1634 @jaygala223
Fix graph breaks in Mixtral #1705 @ShengYang1
Add batch splitting in attention layer to hide NIC latency #1640 @kalyank007
Fix llama FP8 perf issue, kvcache.update should be used since FP8 patches KVCache #1756 @sywangyi
Add HPU fp8 Dynamic MOE #1761 @dudilester

Sentence Transformers

Sentence transformers 3.3.1 #1628 @yafshar

CI

Implement baselines as a fixture and with simple rebase support #1732 @uartie

Other

Fixed formatting #1693 @imangohari1
Fix FlUX.1_dev guidance_batches bug for pad case in _split_inputs_into_batches #1607 @huijuanzh
Fix peft error in Gaudi1 #1627 @sywangyi
Update README.md #1678 @skaulintel
Fix custom ops loading in diffusers #1655 @dsocek
Fix ddpo finetune issue in torch2.5.1 #1666 @sywangyi
Adding Deepspeed zero1 config #1675 @bhargaveede
Enable warmup also for full prompt length case in text generation #1676 @yeonsily
Add padding to input for mllama/paligemma/idefices2 #1671 @sywangyi
Fix for Mixtral G1 pytest failures #1652 @12010486
Fix textual_inversion_sdxl failure on docker 1.20 #1697 @atakaha
Updated Encoder_decoder Tests #1688 @slokesha
Add checks for parallel_state initialization #1680 @yafshar
Update the readme to remove validated models #1703 @jiminha
FP8 baichuan-13b gets oom when running lm_eval with @Liangyx2
Lm eval upgraded to 0.4.7 #1692 @12010486
Enable attention selection in wav2vec-ac #1713 @ugolowic
Fix bug when preparing quant files, starcoder model does not support #1672 @kaixuanliu
Update training pytests to reduce total time #1712 @jiminha
Dropping some ci tests from image_to_text and text_generation #1710 @hsubramony
Add save_checkpoint arg for TIMM training to simplify validation #1701 @ZhengHongming888
Added Unit Test for Gemma-2-27b model #1616 @slokesha
Update TRL README.md to clean up models #1706 @shepark
Support regional compilation #1618 @chaojun-zhang
Fix text generation quality for bf16 models when sampling #1644 @skavulya
Readme modification #1700 @libinta
Fix mpt model generation #1696 @mengniwang95
Fix lm_eval issue of llama #1606 @sywangyi
Align diffusers CI tests with examples #1679 @dsocek
Update audio-classification/requirements.txt to fix numpy version #1717 @hsubramony
Improve automation for stable-diffusion training scripts in README #1651 @dsocek
Fix video diffusion black output if --bf16 is set #1685 @sywangyi
Fix sdxl mlperf time bug #1580 @huijuanzh
Enabling minimize memory for zero3 runs #1724 @bhargaveede
Add gated models to diffusers CI tests #1690 @dsocek
Fix formatting of the kubeVersion range in Kubernetes helm chart #1733 @dmsuehir
Fix llava/llava next issue when working with AutoProcessor #1674 @sywangyi
fea(): reworked the 8x hpu skipping strategy #1694 @imangohari1
Process getting killed while loading data for Llama3.2 90b, 8x #1723 @kalyank007
Fix: Adjust recipe to fit within QueueComputeScal HBM global memory size limit #1722 @kalyank007
Add PRC models to test_text_generation_example.py #1695 @wenbinc-Bin
Added quant config files for new scenarios #1681 @ulivne
Update README.md - correction in diffusers example #1742 @ramyij
Update DS config to align with recommended settings #1730 @ckvermaAI
Add dynamo cache size limit option #1619 @chaojun-zhang
Resolve 'NoneType' object has no attribute 'gate_proj' err when applying EP in DeepSeek-V2 #1740 @IT-Forrest
Edit mixtral quantization config file #1739 @dudilester
Fix the incorrect output of sdxl inpaint #1737 @yuanwu2017
Supports Bitsandbytes development on HPU #1714 @rsshaik1
FLAN-T5 has bad performance when using regional compilation #1744 @chaojun-zhang
Add batch dim idx to support latest deepspeed DistributedAttention #1725 @bhargaveede
Add the inline_inbuilt_nn_modules option #1617 @chaojun-zhang
Clean up README examples #1709 @yeonsily
Accuracy fix for llama3.1-70B in eager/torch.compile mode #1746 @ckvermaAI
Adjust baselines for lower number of epochs improved perplexity, lower throughput #1748 @emascarenhas
Change clip-roberta/bridgetower not to use fast_ddp #1749 @jiminha
Adds requirements.txt to sentence transformers training paraphrases #1753 @pi314ever
Add requirements.txt to sentence transformer training sts #1754 @pi314ever
Add diffuser tests for optimized sdxl flow on HPU #1554 @sushildubey171
Fix the output length in image_to_text test #1751 @sywangyi
Fix Experts Indexing in MoE for Mixtral: Align experts_max with Number of Available Experts #1755 @deepak-gowda-narayana
Add requirements.txt to sentence transformers nli example #1767 @pi314ever
UX code change #1764 @talexjohn
Enable saving and loading FP8 model #1683 @xin3he
Update measurements for Stable Diffusion XL #1773 @mkrze
Add datasets to the requirements for Stable Diffusion training #1782 @yafshar
Enable wav2vec-large model for speech_recognition test #1783 @jiminha
Update multi-node-training environment variables for GaudiNIC #1779 @Jianhong-Zhang
Fixed Gemma2 error when saving pretrain #1781 @kplau1128
Support llava1.5 lora finetuning. #1487 @lkk12014402
Fix DeepSeek-V2 expert-parallelism crash due to indexing error #1765 @skavulya
Update transformer_engine._convert_model to skip LoRA layers #1766 @vivekgoe
Create Habana_Validated_Models.md to list all the models validated #1778 @hsubramony
Enable attention selection for wav2vec2 #1757 @ugolowic
Add --attn_implementation to wav2vec2 slow tests #1788 @ugolowic
Add sentencepiece to the requirements #1792 @hsubramony
Fix LoRA weights loading in text-to-image generation sample script #1789 @dsocek
Add trust_remote_code #1786 @atakaha
Fix the restart issue for Sentence Transformer STS example in validation #1799 @ZhengHongming888
Exp flags for acc issues #1795 @hsubramony
Temporary WA for get_type error #1806 @12010486
Fix Sentence Transformer STS restart issue #1814 @ZhengHongming888
Fix broken link for GenerationConfig #1819 @xin3he
Fix for text-generation, AttributeError: 'GenerationConfig' object has no attribute 'use_fused_rope' #1823 @hsubramony
Fix dataset_version for ST example requirement.txt #1809 @ZhengHongming888
Move model to device before wrapping with FSDP #1830 @skaulintel
Update warmup ratio for adalora #1820 @astachowiczhabana
Fix for attention selection in wav2vec2 #1836 @ugolowic
Revert "Lm eval upgraded to 0.4.7 (#1692)" #1837 @astachowiczhabana
Removing HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM as it's not needed from 1.20 #1726 @bhargaveede
Temporary workaround to avoid segmentation fault #1798 @yafshar

Contributors

uartie, chaojun-zhang, and 50 other contributors

Assets 2

02 Jan 11:36

regisss

v1.15.0

f0438ae

v1.15.0: SynapseAI v1.19.0, FLUX, Mllama, DeepSeek, Falcon 3

SynapseAI v1.19

Upgrade to SynapseAI 1.19 #1667 @regisss

FLUX

FLUX with diffusers 0.31.0 #1450 @dsocek
FLUX Fine-Tuning for Gaudi #1482 @dsocek
Flux Image-To-Image pipeline #1524 @dsocek

New models

Optimized inference of Cohere model on HPU #1329 @XinyuYe-Intel
Idefics2 #1270 @sywangyi
Optimized inference of XGLM model on HPU #1323 @XinyuYe-Intel
Add mllama support #1419 @sywangyi
Enable paligemma model for image-to-text example #1407 @kaixuanliu
Enable Gemma2 Inference on Gaudi #1504 @Luca-Calabria
Minicpm enabling #1342 @pi314ever
Enable Falcon-mamba #1480 @yuanwu2017
Add support for Baichuan2 #1479 @xhaihao
Enable DeepSeek-V2 #1475 @yao-matrix
Add chatglm #1478 @mengker33
Falcon Model Support #1612 @alekseyfa

Various model optimizations

Enable flash attention for gemma #1454 @atakaha
Support loading 4 bit Qwen2 #1476 @mengniwang95
Fixed Gemma FP8 flash_attention lower throughput issue #1510 @kplau1128
Disable default sdpa in Albert (#22) #1517 @astachowiczhabana
Implement fused sdpa for wav2vec2 (#18) #1520 @astachowiczhabana
Memory optimization for gpt_bitcode #1513 @astachowiczhabana
Support beam search with reuse_cache and bucket_internal #1472 @Wei-Lin-Intel
Add mixtral trl sft #1349 @lkk12014402
Enable tiiuae/falcon-11B-vlm in image_to_text example #1490 @sywangyi
Enable fusedsdpa kernel for vision part of mllama #1531 @sywangyi
Enable dynamic compile for mpi(training) #1509 @chaojun-zhang
Add DynamicMoE support for Mixtral #1511 @kwisniewski98
Implemented fusedSDPA for stable diffusion (#36) #1545 @astachowiczhabana
Fix Accuracy Calculation Issue in GPT-NeoX #1591 @yafshar

Sentence Transformers

Update sentence transformer to v3.2.1 #1470 @ZhengHongming888

Textual Inversion XL

Add textual inversion XL for Gaudi #868 @dsocek

TIMM

Enable pyTorch-IMage-Models (TIMM) with HPUs #1459 @ZhengHongming888

Context Parallelism

Adding support for Context Parallelism using Deepseed's DistributedAttention #1501 @bhargaveede
Move parallel_state.py to the distributed folder a6ee7c2044e6ddf7d19ae3ad663149e51d6f89e7 @regisss

CI improvements

Tests for text gen output text #1411 @vidyasiv
Add split runners to CI (2 devices per runner for fast tests) 72df37df46d1d2a2665c5d1be43b13704b7c8ada @regisss
Fix fast CI to work with split runners #1534 @regisss
Add Llama 3.1 ft to CI #1529 @MohitIntel

Documentation

Optimum-Habana docs re-org #1488 @dsocek

Other

Fix facebook/hf-seamless-m4t-medium crash #1433 @sywangyi
Fix bias update in scoped all reduce #1456 @skavulya
fea(pytests): Added skip for unsuported tests for mistral/mixtral #1462 @imangohari1
Remove deprecated Mixed precision flags #1471 @vivekgoe
Readme: replace tabs with spaces #1485 @mgonchar
Move fast tests to Gaudi2 #1498 @regisss
Remove torch req from LM example #1491 @astachowiczhabana
Remove keep_input_mutations #1492 @astachowiczhabana
Fix trust_remote_code #1493 @astachowiczhabana
Upgrade ViT README with torch.compile #1494 @astachowiczhabana
Corrected Throughput measure for GaudiDDPMPipeline #1460 @deepak-gowda-narayana
[SW-196761] Add G3 in T5-L README #1523 @astachowiczhabana
Fix tuple object error #1354 @SupreetSinghPalne
Add warmup time and compile time log for the eval/prediction. #1489 @jiminha
Add support for MLPERF optimized pipeline from example #1465 @ANSHUMAN87
Add check_neural_compressor_min_version for 4 bit behavior #1500 @xin3he
Pass "lazy_mode" arg to GaudiLlamaModel GaudiTrainer #1515 @astachowiczhabana
Removed workaround for NaN bug causing graph break. #1516 @astachowiczhabana
text_generation: improve parameters check #1527 @mgonchar
transformers: fixed some typos #1528 @mgonchar
Makes the with_stack of the profiler changeable #1497 @ranzhejiang
Fix dtype issue with valid sequence length in torch.compile bs=1 #1532 @wszczurekhabana
Migrate OH CLIP (roberta-clip) training to torch.compile #1507 @chaojun-zhang
test_text_generation: fix non-Gaudi2 case #1530 @mgonchar
text-generation: improve output printing #1486 @mgonchar
Text-generation, model set-up: torch.compile for attributes instead of models' types #1452 @dsmertin
Fix bridgetower example #1481 @astachowiczhabana
Migrate OH Wave2Vec-AC training to torch.compile - README update #1537 @astachowiczhabana
Migrate OH T5-large training to torch.compile #1506 @chaojun-zhang
trainer: fixed spelling #1538 @mgonchar
Create CI Eager/Lazy for Language Modeling #1448 @Luca-Calabria
Fixes for llava-next test failures in 1.19 #1535 @tthakkal
Refactor Qwen2 Family #1541 @Wei-Lin-Intel
Add support for optimized SDXL pipeline #1519 @sushildubey171
Add the checkout parameters of falcon-mamba pytest #1540 @yuanwu2017
Avoid negative values in eval metrics #1533 @deepak-gowda-narayana
Fix lm_eval script for starcoder and gemma #1463 @skavulya
Add option to use bf16 in PT sdp (#5) #1514 @astachowiczhabana
Fix tests.test_peft_inference failure #1543 @sywangyi
Update lm_eval version #1473 @alexey-belyakov
Fix bad import in Baichuan code #1547 @regisss
Restore performance in generate #1546 @ugolowic
Fix for llava models not generating text with test failures in 1.19 #1548 @tthakkal
Refactor KV cache, Rope , reduce common code #1148 @abhilash1910
Adjust Qwen2-7B test case #1551 @Wei-Lin-Intel
[run_lm_eval.py] Fixed too many print dump json info #1553 @FocusLuo
Fix for single_card llama7b and falcon40b CI errors #1549 @MohitIntel
Apply --sdp_on_bf16 to image-to-text examples #1557 @schoi-habana
Fix accuracy regression in Gemma #1556 @skavulya
Fix FusedSDPA wrapper from TransformerEngine #1562 @pbielak
Run albert-xxlarge-v1 CI as torch.compile mode #1563 @yeonsily
Update README commands for the models to use --sdp_on_bf16 #1566 @yeonsily
Minicpm patch #1567 @pi314ever
Updated gemma_2b_it CI #1561 @Luca-Calabria
Fixed Adalora Test for OH 1.15 #1564 @npiroozan
Fixed LORACP Test for OH 1.15 #1568 @npiroozan
Fix prefix llama ci failure #1570 @sywangyi
Fix mllama test #1569 @sywangyi
Fix lazy_mode assignment #1558 @vidyasiv
Generation utils update (minor) #1468 @yafshar
Style: removed tabs #1577 @mgonchar
Enable num_return_sequences in beam search #1536 @mengker33
gpt_bigcode: added internal bucketing fix #1526 @mgonchar
Update the Gaudi trainer with transformers 4.45.2 #1398 @yafshar
Revert "add check_neural_compressor_min_version for 4 bit behavior" #1578 @xin3he
Revert PR #1473 #1582 @regisss
Fixed spelling #1576 @mgonchar
Update docs for baichuan2 training #1586 @xhaihao
Add WA flag for falcon-180b to resolve text-gen critical reset error during tests #1590 @hchauhan123
Update transformers tests generation util v4.45.2 #1441 @malkomes
Limit position embeddings in inference #1598 @bhargaveede
Verify model output is provided when check_output is enabled #1597 @vidyasiv
Update README.md #1595 @skaulintel
Fix scikit-learn to 1.5.2 to fix f1 evaluation crash in 1.6.0 #1596 @sywangyi
Update language-modeling README file #1599 @vivekgoe
Revert common KVCache not to check token_idx #1594 @jiminha
Revert LlamaKVCache due to memory increase #1605 @jiminha
Replace the UNET custom attention processors #1608 @yafshar
Fix run_generation test commands for TRL out usage example #1621 @shepark
Update sdp_on_bf16 option for ST example #1615 @ZhengHongming888
Update save lora weights for diffusers with text_encoder_2 layers #1626 @skavulya
Fix save_lora_weights in pipeline_utils.py #1643 @regisss
Check rope_scaling attr #1609 @jiminha
Skip certain tests for G1 with empty param list #1613 @hsubramony
Revert "Update transformers tests generation util v4.45.2 (#1441)" #1614 @yeonsily
Audio classification readme update #1604 @hsubramony
Fix readme cmds for clip-roberta #1603 @hsubramony
Add arbitrary scales #1625 @jiminha
Modify Qwen2 TRL command to avoid OOM. #1630 @jiminha
Fix distributed issue for ST Trainer #1649 @ZhengHongming888
Fix distributed issue for timm #1653 @ZhengHongming888
Refactor mixtral moe block. #1635 @lkk12014402
Speech-recognition: downgrade datasets version #1646 @hsubramony
Add sdp_on_bf16 to controlnet #1631 @skaulintel
Quick fix for quantization/custom op list loading #1657 @dsocek
Fix bug for GaudiMixtralAttentionLongSequence forward #1650 @kaixuanliu

Contributors

chaojun-zhang, yafshar, and 50 other contributors

Assets 2

29 Oct 17:13

regisss

v1.14.1

84b6455

v1.14.1: Patch release

Enable DeepSpeed for image-to-text example #1455 @schoi-habana
Fix bug when loading 4bit checkpoint quantized in INC #1447 @xin3he
Fixes 'Tokenizer does not have padding token' introduced by #1444 for Llama3.1 #1457 @MohitIntel

Full Changelog: v1.14.0...v1.14.1

Contributors

schoi-habana, MohitIntel, and xin3he

Assets 2

22 Oct 16:11

regisss

v1.14.0

058e91c

v1.14.0: Transformers v4.45, SynapseAI v1.18, Qwen2-MoE, text-to-video generation

Transformers v4.45

Upgrade to Transformers v4.45 #1359 @regisss

SynapseAI v1.18

Upgrade to SynapseAI 1.18.0 #1418 @regisss

Qwen2-MoE

Added Qwen2-MoE model, optimizing its performance on Gaudi #1316 @gyou2021

Text-to-video generation

Enabling Text to Video Diffusion Model Generation #1109 @pi314ever
Porting Stable Video Diffusion ControNet to HPU #1037 @wenbinc-Bin

Depth-to-image generation

Depth to Image Generation #1175 @pi314ever

Model optimizations

Enable FusedSDPA for Mpt #1101 @Jianhong-Zhang
Mixtral fp8 #1269 @imangohari1
Prevent Graph break in Llama when using flash attention #1301 @pramodkumar-habanalabs
Boost SDXL speed with initialized schedule step reset #1284 @dsocek
Improve MPT fp8 #1256 @atakaha
Add Whisper static generation #1275 @Spycsh
Gemma: enabled HPU Graphs and Flash Attention #1173 @dsmertin
Recommend jemalloc for gpt-neox-20b 8x #1350 @hsubramony
Optimized inference of GPT-NEO model on HPU #1319 @XinyuYe-Intel
Fix graph breaks for BART in torch.compile mode. #1379 @astachowiczhabana
Gpt_bigcode: added internal_bucketing support #1218 @mgonchar
refine bucket_internal for mpt #1194 @Jing1Ling
Qwen finetuning bucketing #1130 @ssarkar2
Enable FusedSDPA fp8 in Llama FT #1388 @pbielak
Added gemma specific fp8 quantization file #1445 @yeonsily

Intel Neural Compressor

Enable INC for llava models and change softmax to use torch.nn.functional.softmax as its supported module by INC #1325 @tthakkal
Load INC GPTQ checkpoint & rename params #1364 @HolyFalafel
Fix load INC load weights compile error due to Transformer 4.45 upgrade. #1421 @jiminha

Vera/LN-tuning

Vera/ln_tuning add and test case add #1294 @sywangyi

Other

Add callable workflow to post comments when code quality check failed #1263 @regisss
Fix failed code quality check comment workflow #1264 @regisss
Accelerate Diffusers CI #1265 @regisss
Add profiler to SD3 #1267 @atakaha
Fix profiling step with device finish execution for text-generation #1283 @libinta
Update FusedSDPA calling method as Gaudi documentation #1285 @yeonsily
Switch failed code quality check comment to workflow_run #1297 @regisss
Potential fix for the failed code quality check comment workflow #1299 @regisss
Fix text-generation example lm_eval evaluation #1308 @changwangss
Add section to README about Transformers development branch #1307 @regisss
Fix eager mode in run_generation by removing graph logs #1231 @Vasud-ha
Fix bug when running google/paligemma-3b-mix-224 #1279 @kaixuanliu
Use native checkpointing under compile mode #1313 @xinyu-intel
fixed fused_qkv object AttributeError due to 'LlamaConfig' #1203 @rkumar2patel
Image to Image Generation Enabling #1196 @pi314ever
Diffusers timing #1277 @imangohari1
Fix eos issue in finetune/generation #1253 @sywangyi
Update CI, tests and examples #1315 @regisss
Fix Sentence Transformer HPU graphs for training with PEFT model #1320 @nngokhale
Fix ZeroDivisionError in constrained beam search with static shapes #1317 @skavulya
Update esmfold model not to use param_buffer_assignment #1324 @jiminha
Falcon inference crash fix for falcon-40b model #1161 @yeonsily
Add --use_kv_cache to image-to-text pipeline #1292 @KimBioInfoStudio
Trl upgrade #1245 @sywangyi
Fix uint4 url typo. #1340 @kding1
Use eager attention for wav2vec2 #1333 @skaulintel
Add _reorder_cache back to Llama for HPU #1233 @jiminha
SDXL CI script throughput #1296 @imangohari1
Add image so that transformers tests can run #1338 @skaulintel
Fixes the no attribute error with the falcon multicard test #1344 @mounikamandava
Add profiler to sdxl mlperf pipeline #1339 @Jianhong-Zhang
Fix decoder only generation #948 @tjs-intel
Upgrade gradient chekpointing #1347 @yafshar
Run_generation example: fixed graph compilation statistics reporting #1352 @mgonchar
Fix deepseeed crash with Sentence Transformer Trainer #1328 @nngokhale
fea(ci): reduced slow test_diffusers timing. minor fixes #1330 @imangohari1
Flash attn args for GaudiGemmaForCausalLM #1356 @kkoryun
Transformer models generation supports user-provided input embeddings #1276 @zongwave
Fixed the expected values after for img2img slice #1332 @imangohari1
Gpt_big_code: make flash attention impl quantization friendly #1282 @mgonchar
Fix OOM when inference with llama-3.1-70b #1302 @harborn
Fix the conditional #1362 @yafshar
Revert "use native checkpointing under compile mode" #1365 @xinyu-intel
Remove repetitive pip install commands #1367 @MohitIntel
Minor UX enhancement #1373 @MohitIntel
Fix bug when running image-to-text example #1371 @kaixuanliu
Gpt_bigcode: fixed wrong indentation #1376 @mgonchar
Support for transformers without self.model to torch.compile #1380 @astachowiczhabana
Only pass the use_kv_cache True to generator #1366 @yafshar
Clean up the code and remove unnecessary class #1382 @yafshar
Add the diffusers examples of inference Tech #1244 @yuanwu2017
Enhance transformers test suite in Optimum-habana-4.43.4 Auto pr 07654de #1387 @rkumar2patel
Enhance transformers test suite in Optimum-habana-4.43.4 (auto PR 8926a4b) #1386 @rkumar2patel
Add README.md for Sentence transformer examples with HPU device #1355 @ZhengHongming888
Change Falcon/GPT-Neox rotary embedding function to use seq_len for #1368 @yeonsily
Enhance Optimum-habana as per transformers-4.43.4 #1381 @rkumar2patel
CI fix - Install stable-diffusion reqs #1389 @vidyasiv
Fix error caused by uninitialized attn_weights #1391 @hsubramony
Replace flash attention flag #1393 @skaulintel
Fix DeepSpeed CI on Gaudi2 #1395 @regisss
Truncate the cached max seq len #1394 @astachowiczhabana
Fix gpt-neox training accuracy issue. #1397 @yeonsily
Simplify HQT config files #1219 @Tiefen-boop
unify_measurements.py script support to unify PCQ 70B 8x #1322 @Yantom1
Add misc. training args #1346 @SanityRemnants
Add quantization config for low bs case #1377 @ulivne
Remove HQT from OHF #1257 @Yantom1
Valid sequence length for sdpa #1183 @ssarkar2
Multiple fixes (dynamo graph break, qwen-moe, multicard) #1410 @ssarkar2
Change the image path for transformers tests back to the correct location #1401 @skaulintel
Fix Gaudi2 regression tests #1403 @regisss
Reverting some of transformer pytest funcs/values #1399 @imangohari1
Fix StarCoder2 inference #1405 @regisss
Change the order for test_diffusers #1406 @hsubramony
Fix llama model text generation error #1402 @zongwave
Datasets downgrade version to 2.21.0 #1413 @hsubramony
Update ci sentence_transformer.sh #1424 @ZhengHongming888
Update language-modeling README.md, add trust_remote_code for flan-t5-xl #1422 @hsubramony
Update unify_measurements.py support info #1425 @shepark
Fix GPT_neox incorrect output with batch query #1358 @Jianhong-Zhang
Fix text-to-image example #1429 @regisss
Add flag to run inference with partial dataset #1420 @pramodkumar-habanalabs
Add peft generation example #1427 @sywangyi
Added missing allocate_kv_cache() call in CausalLM class #1431 @yeonsily
Fix merge error and update text-to-speech readme #1436 @hsubramony
Fix OOM error for code llama #1437 @jiminha
Fix error on 4bit checkpoint load with run_lm_eval on TF4.45.2 #1439 @jiminha
GPT2 torch.compile fix #1434 @dsmertin
Update text-gen README.md to add auto-gptq fork install steps #1442 @hsubramony
Fix scoped linear all-reduce for starcoder model #1432 @skavulya
Fixed recursion error in SentenceTransformer #1428 @yafshar
Fix Llama 3.1 generation #1444 @regisss
Remove cache folder from image data folder #1446 @shepark

Contributors

harborn, yafshar, and 47 other contributors

Assets 2

06 Sep 20:17

regisss

v1.13.2

1266993

v1.13.2: Patch release

Llava(-next) improvements

This patch release adds multi-card support for Llava(-next) and enables users to turn on/off recomputing for flash attention.

Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute #1278 @tthakkal
Add the deepspeed injection_policy of mistral #1309 @yuanwu2017

Full Changelog: v1.13.1...v1.13.2

Contributors

yuanwu2017 and tthakkal

Assets 2

25 Aug 13:34

regisss

v1.13.1

52e22cb

v1.13.1: Patch release

Fixed memory regressions

Remove _expand_inputs_for_generation for greedy search (#1266) @libinta
Fix memory regression for modeling llama (#1271) @libinta

FSDP

FSDP checkpoint saving is fixed.

Fix BERT FSDP test (#1281) @regisss

Known limitations

ESMFold does not work on Gaudi1, this will be fixed in a future version

Full Changelog: v1.13.0...v1.13.1

Contributors

regisss and libinta

Assets 2

16 Aug 14:25

regisss

v1.13.0

41e0a3f

v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example

SynapseAI 1.17

Upgrade SynapseAI version to 1.17.0 #1217

Transformers 4.43

Upgrade to Transformers 4.43 #1163 @regisss

Diffusers 0.29

Upgrade optimum-habana diffusers dependency from 0.26.3 to 0.29.2 #1150 @dsocek

Stable Diffusion 3

Sd3 #1153 @dsocek
Refactor SD3 #1199 @dsocek

Training with Sentence Transformers

Enable Sentence Transformer Trainer with Gaudi #1111 @ZhengHongming888

Model optimizations

Fix starcoder2 accuracy issue and optimize performance with fused rope #1095 @mandy-li
Enable FusedRoPE using float32 for gpt-neox model #1104 @yeonsily
Mamba initial enablement. #1122 @libinta
Adding fused qkv support along with config #1102 @bhargaveede
Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087 @Zhiwei35
Enable fp8 inference for Llava-Next and add Fused_SDPA #1120 @tthakkal
Support bucket_internal for MPT #1137 @pk1d3v
Enable Flash Attention (Fused SDPA) for Starcoder #1114 @abhilash1910
gpt_bigcode: added FusedSDPA kernel #1138 @mgonchar
Enable torch.compile for Granite20B #1185 @dvarshney-habana
Refine use cache for mpt model #1158 @Jing1Ling
GPT-J support reuse_cache #1094 @atakaha
Use fast softmax only on prefill #1159 @jaygala223
Starcoder2 : KVCache and flash attention (FusedSDPA) enablement #1149 @abhatkal
Gpt bigcode fused sdpa #1260 @yeonsily

SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM

Add an example of Segment Anything Model [Inference] #814 @cfgfung
Add an example of FastViT model (Infernece) #826 @cfgfung
VideoMAE Model Enabling and Examples #922 @pi314ever
OpenCLIP sample for visual question answering #977 @vidyasiv
Enabled DETR (Object Detection) model #1046 @cfgfung
Table transformer enabling #978 @pi314ever
deciLM support #1133 @sywangyi

Stable Diffusion inpainting, unconditional image generation

Add the Stable diffusion inpaint support #869 @yuanwu2017
Enable Unconditional Image Generation on Gaudi 2 [Diffuser/Tasks] #859 @cfgfung

Text feature extraction example

Feature extraction enabling #994 @pi314ever

Tensor parallelism

Tensor parallel distributed strategy without using deepspeed #1121 @kalyanjk
Disable torch.compile for all_reduce when parallel_strategy is set to "tp" #1174 @kalyanjk

Kubernetes cluster example

Adds a helm chart, dockerfile, and instructions for running examples using a Kubernetes cluster #1099 @dmsuehir
Fix PyTorch version in the Kubernetes docker-compose to match image #1246 @dmsuehir

FP8 training

TE FP8 integration #1096 @sanjucsudhakaran

Other

Updates run_lora_clm.py with enhanced dataset support #955 @dmsuehir
Fix prefix tuning finetune issue and update test #975 @sywangyi
Fix throughput calculation in image-to-text example #1070 @regisss
SDXL-trainig: fixed ci, changed gated dataset, fixes for non-square datasets #1038 @imangohari1
Updating batch_size of Albert-XXL in README #1063 @vineethanandh
Fix the error of running run_pipeline.py of text_generation example #1055 @yuanwu2017
Add a test for llama finetuning with FP8 precision #1106 @sanjucsudhakaran
Beam-search fix #1113 @ssarkar2
Add chat format support dataset in SFT #1066 @libinta
Fix nan loss of gemma and crash if dataset_concatenation is not set #1088 @sywangyi
torch.compile keep input mutation in graph this avoids unnecessary memcpy #1069 @sushildubey171
Updated langchain text-generation pipeline to work with latest release 0.2.5 #1084 @rbrugaro
Add the MC example #891 @yuanwu2017
Fix recompiles if limit_hpu_graph is False #1129 @ssarkar2
Update examples batchsize in README #1123 @shepark
Fix OOM error in SDXL Fine-Tuning validation stage #1134 @dsocek
Added an example code to demonstrate how to use deterministic image generation #878 @cfgfung
SD image variation/InstructPix2Pix/StableDiffusionXLImg2ImgPipeline pipeline #988 @sywangyi
Add ci test for trl rewarding and ppo, fix backward failure in ppo caused by rmsfusion #1020 @sywangyi
Llama adapter #983 @sywangyi
torch.flip issue is fixed in SynapseAI 1.16, so remove the WA #1092 @sywangyi
Fix test CausalLanguageModelingLORAExampleTester KeyError #1139 @dmsuehir
fix(ci): new runs-on #1136 @XciD
Add trust_remote_code for loading datasets in the audio classification example #1074 @regisss
Generation example: print number of warmup iterations #1145 @mgonchar
CI Updates: text-gen to recieve ranks/bs, Updated bs/metric for baselines #1140 @imangohari1
Support for custom files for run_lora_clm.py #1039 @vidyasiv
Change the device_id for FSDP plugin #1086 @ckvermaAI
Set KV Cache update as static method #1160 @ulivne
To fix CPU tensor issue #1157 @mkumargarg
Adding missing init.py to mistral and mixtral test package #1188 @rkumar2patel
Add example of multitask_prompt/poly tuning #915 @sywangyi
Fix data-type mismatch for mlperf_inference accuracy test #1146 @kalyanjk
Fix spawn MP context, limit cpu and download data #1131 @polisettyvarma
T5 multi card #1222 @yafshar
Add trust_remote_code for t5 poly-tuning test #1220 @yafshar
Resolve "empty tensor optional" error with hpu_graphs + kv cache for StarCoder #1181 @vidyasiv
Fix VIT, add wav2vec comment #1223 @ssarkar2
Roberta tests were running on CPU #1229 @ssarkar2
Fix bert/roberta contrastive search tests #1226 @skavulya
Remove the default env variable to trust remote code by default #1225 @yafshar
Improve style check workflow #1230 @regisss
Added scheduler selection for SDXL fine-tuning #867 @kplau1128
Clear help msg for ignore_eos to avoid misunderstanding @sywangyi
Support loading hugging face checkpoint #1165 @ulivne
Change triggering event for code style check #1238 @regisss
gptj: fix missing token_idx #1234 @envsp
fix(nltk): fixed the version to working one #1247 @imangohari1
Updating to avoid hardcoding tests in CI framework #1221 @vidyasiv
Fix FSDP graph error due to Tranformer 4.43 update #1251 @jiminha
Fix SD README commands #1250 @imangohari1
Fix spelling errors #1252 @changwangss
Set HLS_MODULE_ID only if it wasn't set previously #1254 @astachowiczhabana
Fix overflow of steps in SDXL for default diffusers scheduler @dsocek
fix(test_diffusers): automated the checking for tests without upstream HF #1232 @imangohari1
fix(nltk): Revert 1247. Updated the version. added the punkt_tab download #1258 @imangohari1
Set input_embeds before it gets used #1261 @tthakkal
Update README and more changes, rebase to main #1259 @shepark

Known limitations

For Llama, some big batch sizes lead to out-of-memory errors whereas they used to work

Contributors

kalyanjk, yafshar, and 41 other contributors

Assets 2

11 Jul 13:51

regisss

v1.12.1

820901c

v1.12.1: Patch Release

Fix 1st token latency time measure

Fix 1st token latency time #1091 @libinta

Fix for Mixtral

Mixtral typo fix #1107 @schoi-habana

Other

Fix for selective seq length test with batch size 1 #1110 @libinta

Full Changelog: v1.12.0...v1.12.1

Contributors

libinta and schoi-habana

Assets 2

22 Jun 18:28

regisss

v1.12.0

6adad16

v1.12: Qwen2, Gemma, SVD, Dreambooth, speculative sampling

SynapseAI v1.16

Upgrade to SynapseAI v1.16 #1043 @regisss

Transformers 4.40

Upgrade to Transformers 4.40 #1027 @regisss

Speculative Sampling

Speculative sampling on Gaudi using Optimum-Habana #973 @nraste
Fix assisted decoding generation error #1080 @libinta

Model optimizations

Add --bucket_size support for gpt_bigcode #802 @jiminha
Optimize StableLM model inference #805 @XinyuYe-Intel
Enable google/gemma-7b. #747 @lkk12014402
Enable llava static generation. #767 @lkk12014402
Fix perf drop in flan-t5 summarization #908 @MohitIntel
Enable Qwen2 model #774 @XinyuYe-Intel
Extend bucket_internal to SAMPLE generation mode #819 @xt574chen
SpeechT5 static consistent dropout #824 @Spycsh
Optimize inference of Persimmon model #822 @XinyuYe-Intel
Enable OWL-ViT graph mode on Gaudi platform #783 @cfgfung
Support mixtral kvcache reuse and remove kv_cache_fp8 #898 @jychen21
Add fp8 related changes to mistral for text-generation #918 @skaulintel
Optimization for phi series models: support fp8 kv cache and reuse kv cache #902 @yuwenzho
Support Mistral 32K input token #931 @jiminha
Support mixtral long sequence 32k with bs 4 #903 @jychen21
Adapt Mixtral long sequence handling for Mistral #985 @jiminha
Fix performance issue in mistral #1030 @jiminha
Optimized inference of Starcoder2 model #829 @XinyuYe-Intel
Add support for IBM Granite #1045 @regisss
Enable fp8 inference for Llava-hf 7B and 13B in 1.16 release #951 @Luca-Calabria
Fusedrope inp bf16 #1026 @ssarkar2
Enhance Qwen2 model with FSDPA and bucket #1033 @Zhiwei35
Optimize seamless-m4t/vits model for text-to-speech generation #825 @sywangyi
cache_optimization #1028 @ssarkar2
Ensure KV cache is not returned as output tensor during decode phase for Falcon #993 @schoi-habana
Fast softmax #972 @wszczurekhabana
Falcon optimization #974 @libinta
Quantization for FSDPA #976 @dudilester
Falcon update park #1052 @ssarkar2
Add the Llava_next support #1041 @yuanwu2017
Improve torch compile performance #1082 @libinta

Stable Video Diffusion

Add SVD pipeline #743 @dsocek

PEFT

Add ia3 and adalora support #809 @sywangyi
Enable prompt tuning/prefix tuning/p tuning clm and example #758 @sywangyi

TRL

Finetuning stable diffusion with DDPO #733 @skavulya

Object Segmentation Example

Add an example of object segmentation (ClipSeg) #801 @cfgfung

Dreambooth

Diffuser dreambooth full/lora/lokr/loha/oft finetune, dreambooth XL lora finetune #881 @sywangyi

Others

Text generation pipeline: Extended functionality to align with run_generation script #782 @mgonchar
Enable clip mediapipe and update G2 baseline #856 @MohitIntel
Add ci test for SFT and DPO #857 @sywangyi
Fix SFT, DPO CI on Gaudi1 #893 @regisss
Add SDXL in README #894 @regisss
Fix falcon 180b oom issue if peft > 0.6.2 #895 @sywangyi
Enabled additional models in CI #879 @MohitIntel
Add static shape support for vision_encoder_decoder generation if decoder supports static shape #834 @sywangyi
Add HabanaProfile to Stable Diffusion and XL #828 @atakaha
Pytest accuracy updates for Falcon, T5, GPT2 #916 @Luca-Calabria
Update text-generation readme with torch.compile info. #884 @libinta
Update Wav2Vec2ModelTest::test_initialization #919 @malkomes
Add linear and dynamic RoPE to Mistral and Mixtral #892 @regisss
Fix for wav2vec2 test cases #923 @lqnguyen
Add nograd() to prevent backward backend #897 @astachowiczhabana
Assisted decoding not implemented #910 @tjs-intel
Disable wav2vec2 symbolic tracing test #904 @tjs-intel
Add support for symbolic tracing of GPT2 models #913 @tjs-intel
Utils: return more reasonable error in case of attempt of non-PyTorch model loading #921 @mgonchar
Pytest accuracy updates for Bridgetower, Swin, Vit #927 @Luca-Calabria
Text generation: added langchain pipeline script #887 @mgonchar
Fix for AST models #914 @vidyasiv
Fix AttributeError for wav2vec test #929 @Jianhong-Zhang
Fix ValueError for test_summarization #939 @Jianhong-Zhang
Grad norm tensor fix #938 @yeonsily
Add information to the audio-classification examples README about --ddp_find_unused_parameters parameter #941 @Alberto-Villarreal
Add leaderboard link #947 @echarlaix
Fix formatting of arg parse help strings in the PEFT example #944 @dmsuehir
Use new Habana llama and falcon model configs #940 @skaulintel
Update based on legal requirements. #900 @libinta
Update test generation config to raise ValueError #949 @malkomes
Add --trust_remote_code for text generation examples #870 @yangulei
Added Llama-2 fp8 text-generation test cases #934 @yeonsily
Upgrade SD output image verification with CLIP score #920 @MohitIntel
Llama Guard for text classification example #871 @dsmertin
Update README logo #950 @regisss
Add Gaudi CI for Sentence Transformers #928 @regisss
Get iteration times through generate() #899 @hsubramony
Update speech recognition seq2seq example #953 @regisss
Fix wrongly all_gather for mixtral finetune #965 @ccrhx4
Add intel-mila protST example #860 @sywangyi
Small CI refacto #968 @regisss
Llama70b one card to infer device map with max memory limitation #963 @Yantom1
Map list to tensors #926 @ssarkar2
Fix fsdp lora torch compile issue #971 @sywangyi
Fix for the simulate_dyn_prompt flag assertion #984 @alekseyfa
Initial enablement with FP8 Training (port from OHF #91) #936 @libinta
Warn user when using --disk_offload without hqt #964 @Yantom1
Assign grad_norm for logging only if it's a single element tensor #992 @yeonsily
Update examples #998 @regisss
Fix warmup for diffusers when batch size < throughput_warmup_steps #960 @dsocek
Add torch.compile instructions for Roberta-Large #981 @MohitIntel
Fix gpt_neox, stablelm inference regression caused by RoPE dtype #999 @mandy-li
fea(examples): Updated the READMEs with requirements.txt installation #1000 @imangohari1
Initial commit for fp8 CI #995 @yeonsily
Fixed 'MixtralConfig' object has no attribute 'rope_scaling' #1009 @aslanxie
Use the lenght of timesteps as the inference step num #986 @yuanwu2017
Fix the bug of output_type=np or latent. #996 @yuanwu2017
Fix wav2vec test load adapter #937 @malkomes
Mark scale as const and remove --fp8 flag usage #962 @Yantom1
Add per step time collection to other methods #1004 @ssarkar2
Fix first token time #1019 @ssarkar2
Fix text-generation example #1025 @regisss
Updates test_beam_search to transformers_4.40 #1017 @malkomes
Fix eos problem #1034 @sywangyi
fp8 textgen ci structure update #1029 @jiminha
Fix a return value issue casued by PR 973 #1040 @yafshar
Add no_checks for sub dataset in lvwerra/stack-exchange-paired since it does not contain test split #1003 @sywangyi
Readme Update for FSDP #980 @hlahkar
Add unifier script and disk offload flag usages to README. #1023 @libinta
Add mixtral for meta device load due to mixtral-8x22b model size #909 @libinta
Update unifier script #1010 @Yantom1
Update text-generation CI configuration for falcon and Mixtral #1044 @yeonsily
Update multi-node README to check ssh connection issue #1048 @yeonsily
Infra upgrade workflows #480 @glegendre01
Update test_text_generation_example.py #1051 @ssarkar2
BERT training migrated to torch.compile #990 @ANSHUMAN87
Update test_examples.py #1053 @ssarkar2
Update modeling_llama.py: deepspeed fix for codellama #1054 @ssarkar2
No shapes in profilings by default #1050 @astachowiczhabana
Change the way to unset environemt variable for gpt-neox ci #1060 @yeonsily
Update README for Albert torch.compile mode #1061 @MohitIntel
Fix lm_evaluation_harness to specific commit (#240) #1064 @astachowiczhabana
Fix text-generation example README.md #1081 @shepark

Contributors

yafshar, skavulya, and 47 other contributors

Assets 2

Releases: huggingface/optimum-habana

v1.17.0: Transformers v4.49

Transformers v4.49

Model optimizations

Tests and CI

Other

Contributors

Uh oh!

v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ

SynapseAI v1.20

New models

Llama 405b

AWQ

Various model optimizations

Sentence Transformers

CI

Other

Contributors

Uh oh!

v1.15.0: SynapseAI v1.19.0, FLUX, Mllama, DeepSeek, Falcon 3

SynapseAI v1.19

FLUX

New models

Various model optimizations

Sentence Transformers

Textual Inversion XL

TIMM

Context Parallelism

CI improvements

Documentation

Other

Contributors

Uh oh!

v1.14.1: Patch release

Contributors

Uh oh!

v1.14.0: Transformers v4.45, SynapseAI v1.18, Qwen2-MoE, text-to-video generation

Transformers v4.45

SynapseAI v1.18

Qwen2-MoE

Text-to-video generation

Depth-to-image generation

Model optimizations

Intel Neural Compressor

Vera/LN-tuning

Other

Contributors

Uh oh!

v1.13.2: Patch release

Llava(-next) improvements

Contributors

Uh oh!

v1.13.1: Patch release

Fixed memory regressions

FSDP

Known limitations

Contributors

Uh oh!

v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example

SynapseAI 1.17

Transformers 4.43

Diffusers 0.29

Stable Diffusion 3

Training with Sentence Transformers

Model optimizations

SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM

Stable Diffusion inpainting, unconditional image generation

Text feature extraction example

Tensor parallelism

Kubernetes cluster example

FP8 training

Other

Known limitations

Contributors

Uh oh!

v1.12.1: Patch Release

Fix 1st token latency time measure

Fix for Mixtral

Other

Contributors